Qualcomm AI Engine Direct - calibration thread auto-tuning by abhinaykukkadapu · Pull Request #18184 · pytorch/executorch

abhinaykukkadapu · 2026-03-14T19:54:49Z

TL;DR

Calibration overall time has been cut to near ~10-25 minutes compared to previous 2.5h for various models (10x optimization for decode phase). These optimizations are stacked results from multiple commits. Only remaining bottleneck is the QNN SDK Compile which is opaque to us.

Thread tuning

AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that sweeps candidate thread counts via a quick microbenchmark and picks the fastest. CLI override via --calibration_num_threads.

On a 72-vCPU host, auto-tune selects 18-36 threads, yielding 4.6x faster calibration (24 min vs 1h51m) with no PPL regression.

Host	baseline	Candidates
36-core VM (72 logical)	72	[1, 9, 18, 27, 36, 48, 54, 72]
Same host, 8 cores pinned	8	[1, 2, 3, 4, 5, 6, 8]

Calibration times for few models

Model	Params	SeqMSE	Auto-tune	DECODE Calibration	Minutes
smollm2_135m	135M	0	auto	565.8s	9.4
qwen2_5-0_5b	0.5B	0	auto	736.3s	12.3
qwen2_5-1_5b	1.5B	0	auto	933.6s	15.6
qwen3-0_6b	0.6B	50	auto	961.9s	16.0
gemma3-1b	1B	0	auto	987.8s	16.5
smollm3-3b	3B	0	auto	1,355.0s	22.6
llama3_2-1b_instruct	1B	50	auto	1,434.3s	23.9
llama3_2-3b_instruct	3B	0	auto	1,774.8s	29.6

Llama3.2-1B PPL Validation

Config	word_perplexity	byte_perplexity
Baseline (1000 cands, 72 thr)	15.45	1.696
+ SeqMSE 50 + no PREFILL calib	14.97	1.685
+ Thread auto-tune	15.03	1.687

cc @cccclai @cbilgin @digantdesai @tanvirislam-meta

pytorch-bot · 2026-03-14T19:54:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18184

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Pending, 2 Unrelated Failures

As of commit bef50da with merge base eb92cec ():

NEW FAILURE - The following job has failed:

pull / unittest / linux / linux-job (gh)
backends/xnnpack/test/ops/test_conv2d.py::TestConv2d::test_qs8_conv2d_bn

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / unittest / windows / windows-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-14T19:55:32Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

cccclai · 2026-03-16T16:36:51Z

examples/qualcomm/oss_scripts/llama/wrappers/llm_wrappers.py

+                original_threads = torch.get_num_threads()
+                torch.set_num_threads(calib_threads)


What does it actually do and mean? How is it different between cpu and gpu? Can we use gpu to calibrate still?

I checked a bit more and this is what claude said

PyTorch uses a heuristic that depends on the environment:

Locally / outside containers: It typically defaults to the number of logical CPU cores (os.cpu_count()), which counts hyperthreaded cores.

In containers / limited environments (like Docker with CPU limits, Kubernetes, or certain cloud VMs): PyTorch tries to respect CPU affinity and cgroup limits, so the thread count may be lower.

With OpenMP: If PyTorch is compiled with OpenMP (common on Linux), the thread count may be governed by OMP_NUM_THREADS, which, if unset, OpenMP often sets to the logical core count.

It seems like this is specific for PyTorch OpenMP built

Curious what Qualcomm folks set up is. @haowhsu-quic

Yeah, in my experiments, the high per iteration time is due to threads waiting at the barrier (you can see the large pillar in the flamegraph from the GH linked issues, it is named mkl_blas_sgemv). This is matrix-vector multiply, specific to decode though as the workloads are smaller due to conv2d kernels, pytorch seems to default high thread counts assuming larger workloads.

@haowhsu-quic can you please pull this PR on top of main (i just merged my coarse + fine pr) and see if tuning works on other vms.

How about GPU? Does it make a difference?

Thanks. It looks like the initial thread also use mkl_get_max_threads. Not quite sure how they're different...

Regarding GPU, I think the GPU logic is shared with CPU? Like we can also do model.to("cuda") and do the rest if needed and it goes through the same path. I ran this path a while ago, unsure if it is still works. Just trying to make fewer burden for us to use gpu to calibrate model here

Not quite sure how they're different

We tune the workload for the host with various number of threads for optimal threads.

Yeah, in my experiments, the high per iteration time is due to threads waiting at the barrier (you can see the large pillar in the flamegraph from the GH linked issues, it is named mkl_blas_sgemv). This is matrix-vector multiply, specific to decode though as the workloads are smaller due to conv2d kernels, pytorch seems to default high thread counts assuming larger workloads.

@haowhsu-quic can you please pull this PR on top of main (i just merged my coarse + fine pr) and see if tuning works on other vms.

Will test them in the weekend (have other ongoing tasks occupying my machine), please wait for a few days.

Hi @abhinaykukkadapu, I've tested with 2 machines:

Intel(R) Core(TM) i7-14700 with 28 cores:

Model Best Cores Calibration Time (Hybrid Mode)

smollm2_135m 8 313s

qwen2_5-0_5b 8 289s

qwen2_5-1_5b 20 965s

qwen3-0_6b 8 1058s

gemma3-1b 20 668s

smollm3-3b 20 1659s

llama3_2-1b_instruct 20 3043s

llama3_2-3b_instruct 20 1707s

(VM) AMD EPYC 7H12 with 16 cores:

Model Best Cores Calibration Time (Hybrid Mode)

smollm2_135m 11 FP Exception

qwen2_5-0_5b 16 FP Exception

qwen2_5-1_5b 12 FP Exception

qwen3-0_6b 16 FP Exception

gemma3-1b 12 FP Exception

smollm3-3b 8 FP Exception

llama3_2-1b_instruct 16 FP Exception

llama3_2-3b_instruct 12 FP Exception

The AMD processor has some issue in recent pytorch version which is addressed in #18098.

@haowhsu-quic thanks for testing it, can you please stamp if you don't have any concerns. Thanks

haowhsu-quic

Impressive, thank you.

AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that samples fractions of the available thread ceiling (1/8 through 1.0) via a quick microbenchmark before prepare_pt2e — no observers exist yet, so synthetic benchmark inputs cannot pollute calibration state. Uses sched_getaffinity when available to respect cgroup/taskset constraints. Thread count is scoped to calibration only and restored after decode calibration phase. CLI override via --calibration_num_threads (0 = auto-tune, default). On a 72-vCPU host, auto-tune selects 18-36 threads depending on the workload, yielding 10.1x faster calibration (21.8 min vs 3h40m) with no PPL regression.

abhinaykukkadapu added this to ExecuTorch Core Mar 14, 2026

abhinaykukkadapu added the module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ label Mar 14, 2026

github-project-automation bot moved this to To triage in ExecuTorch Core Mar 14, 2026

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 14, 2026

abhinaykukkadapu linked an issue Mar 14, 2026 that may be closed by this pull request

Optimize decode loop in calibration #18065

Closed

abhinaykukkadapu moved this from To triage to In progress in ExecuTorch Core Mar 14, 2026

abhinaykukkadapu marked this pull request as ready for review March 14, 2026 21:38

abhinaykukkadapu requested a review from cccclai as a code owner March 14, 2026 21:38

abhinaykukkadapu requested review from chenweng-quic, haowhsu-quic, shewu-quic and winskuo-quic March 14, 2026 21:38

abhinaykukkadapu mentioned this pull request Mar 14, 2026

Optimize decode loop in calibration #18065

Closed

abhinaykukkadapu force-pushed the calibration-thread-tuning branch 2 times, most recently from 3b02283 to 02f6db3 Compare March 16, 2026 16:11

cccclai reviewed Mar 16, 2026

View reviewed changes

haowhsu-quic approved these changes Mar 18, 2026

View reviewed changes

abhinaykukkadapu force-pushed the calibration-thread-tuning branch from 02f6db3 to bef50da Compare March 18, 2026 17:30

cccclai approved these changes Mar 18, 2026

View reviewed changes

abhinaykukkadapu merged commit 3906b58 into pytorch:main Mar 18, 2026
134 of 137 checks passed

github-project-automation bot moved this from In progress to Done in ExecuTorch Core Mar 18, 2026

		original_threads = torch.get_num_threads()
		torch.set_num_threads(calib_threads)

Model	Best Cores	Calibration Time (Hybrid Mode)
smollm2_135m	8	313s
qwen2_5-0_5b	8	289s
qwen2_5-1_5b	20	965s
qwen3-0_6b	8	1058s
gemma3-1b	20	668s
smollm3-3b	20	1659s
llama3_2-1b_instruct	20	3043s
llama3_2-3b_instruct	20	1707s

Model	Best Cores	Calibration Time (Hybrid Mode)
smollm2_135m	11	FP Exception
qwen2_5-0_5b	16	FP Exception
qwen2_5-1_5b	12	FP Exception
qwen3-0_6b	16	FP Exception
gemma3-1b	12	FP Exception
smollm3-3b	8	FP Exception
llama3_2-1b_instruct	16	FP Exception
llama3_2-3b_instruct	12	FP Exception

Conversation

abhinaykukkadapu commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Thread tuning

Calibration times for few models

Llama3.2-1B PPL Validation

Uh oh!

pytorch-bot bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18184

❌ 1 New Failure, 3 Pending, 2 Unrelated Failures

Uh oh!

github-actions bot commented Mar 14, 2026

This PR needs a release notes: label

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinaykukkadapu Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haowhsu-quic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abhinaykukkadapu commented Mar 14, 2026 •

edited

Loading

pytorch-bot bot commented Mar 14, 2026 •

edited

Loading

This PR needs a `release notes:` label

abhinaykukkadapu Mar 16, 2026 •

edited

Loading